1 Introduction

Hi, welcome to this fourth lesson, on advanced data exploration.

In the previous lessons, you have explored you first dataset with aggregate data at the geographical region level, and created different types of charts that you combined into your first dashboard.

In this lesson, you will explore a new dataset, with raw individual data, create charts to visualize quantative data distributions and the relation between categorical variables, and aggregate your individual data by different columns in the dataset.

2 Learning Objectives

By the end of this lesson,

3 Dataset selection

First, log into the EpiGraphHub platform and go to the Datasets page. There, type positive_cases_covid_d in the SEARCH field, and click on the dataset with this name (the dataset highlighted below).

The following page will open:

In this dataset, we will work with individual data on positive cases of COVID-19 in Colombia. To do some initial exploration, we will select the RAW RECORDS QUERY MODE and the following columns of the dataset:

Below is the meaning of these columns:

After selecting these columns and clicking on the RUN QUERY button, that’s the result:

Now, let’s do some further exploration with these columns.

Remember that to see the result of a query as a table you need to set VISUALISATION TYPE to Table.

4 Exploration

In this section, let’s learn how to make simple operations that will help you to better understand your data.

4.1 Null values

To discover how many NULL VALUES there are in a column, select the AGGREGATE QUERY MODE, and add the metric COUNT(*) that will count the number of rows returned:

Also, in the FILTERS field, add your column of interest, with the operator IS NULL.

For example, to discover the number of null values in the column fecha_inicio_sintomas you must use the filter:

And that’s the result:

This column has 489’336 null values.

To get the total number of rows in the table (6.1 million), run this query (using the metrics COUNT(*)) without applying any filter.

To remove the NULL VALUES in a specific column, e.g., fecha_inicio_sintomas, for the remaining of the exploration, go to the FILTERS section and apply the following filter:

And click on the SAVE button.

4.2 Unique values

To see the number of unique values in a column, stay on the AGGREGATE QUERY MODE. For example, to see the unique values in the sexo column and count the number of covid-19 cases per sexo value, select this column in the GROUP BY field and add COUNT(sexo) or COUNT(*) in the METRICS field below.

You will then see these fields filled on your screen:

After clicking on the RUN QUERY button, this will be the result:

In the region highlighted above, you can see the exact number of F (females) and M (males) observations.

4.3 Histogram

The histogram can be used to provide the distribution of our data. Using the positive_cases_covid_d, let’s plot a histogram of the age of dead individuals.

In this case, we will use the column edad (that refers to the age of the individual), the column unidad_medida (to ensure that the value in edad represents years), and the column estado to filter by the Fallecido (dead individuals).

To do this, click on the Table value to change the VISUALISATION TYPE:

In the window that will open, type histogram:

Click on the histogram and on the SELECT button.

In the Explorae Page, select the column edad in the COLUMNS field and apply the two filters presented below:

So, your final query configuration is:

After clicking on the RUN QUERY button, this will be the displayed result:

To change the number of bins in the histogram, go to the CUSTOMIZE tab:

and change the NO OF BINS field:

After setting NO OF BINS to 20, you will get the result below:

If you hover the mouse over the bars, you get the information about what range of the edad (age) the bar represents, what is the exact count, cumulative, and percentile that this bar represents:

4.4 Box plot

The boxplots can be used to summarise the statistics of the values in a column.

Analogous to what was done in the previous section, using the positive_cases_covid_d dataset, we will plot a boxplot of the age of dead individuals by department.

In this case, we will use the columns:

  1. edad that refers to the age of the individual,
  2. unidad_medida to ensure that the value in edad represents years, and
  3. estado to filter by the Fallecido (dead individuals).

To do this, on the Exploration page, click on the Table value to change the VISUALIZATION TYPE:

In the window that will open, type box plot:

Click on the Box Plot and on the SELECT button.

For this chart, the METRICS field is mandatory. As we are interested in the age distribution, let’s start by setting the METRICS field to AVG(edad) in the Query section.

And RUN QUERY. You should then get the following result, with only one box plot representing all age values in our dataset:

By default, the temporal column defined in TIME COLUMN and the TIME GRAIN are used to compute the value in METRICS.

In the result above, it means that :

  1. all patients’ records were first grouped by Day based on the fecha_reporte_webcolumn, then
  2. the average ages (METRICS = AVG(edad)) by day were computed, and eventually
  3. the distribution of these average ages was plotted

To convince yourself about the explanation above, change the operator in the METRICS field from AVG to SUM, and see how it impacts the plotted distribution.

Now, to see the distribution of the age of individuals, without prior aggregation by time, we will fill the field DISTRIBUTE ACROSS with the id_de_caso column, representing the unique case ID of each record in our dataset . This will ensure that the box plot will use the age of each individual patient.

As a consequence of using DISTRIBUTE ACROSS = id_de_caso, selecting the AGGREGATE operator in METRICS to be SUM or AVG will not change the result distribution. Let’s then keep METRICS as AVG(edad).

To get the age in years only for the dead individuals, let’s apply the two filters presented below:

In the SERIES field, we will select the column with unique values to be shown along the X axis. For each unique value in this column, a box plot will be computed.

To get one box plot for each department in Colombia, let’s set SERIES to the column departamento_nom. So, your final query configuration is the following:

After clicking on the RUN QUERY button, this will be the received result:

To see the name of the departamento_nom associated with each box plot, we can rotate the X axis labels. To do this, go to the CUSTOMISE tab:

and change the X TICK LAYOUT field to 90º:

and that’s the result:

If you hover the mouse over a box plot, you get the information about the quartiles, observation, and outliers:

You can change the time range considered in the region below:

And in the field below, you can change the type of box plot:

By default, it uses Tukey, where the min and max values are at most 1.5 times the IQR (interquartile range) from the first quartile (25 percentile) and third quartile (75 percentile), respectively. The other available options are:

  • Min/max (no outliers);
  • 2/98 percentiles;
  • 9/91 percentiles.

4.5 Pivot table

A pivot table is a good visualisation to see the relation of discrete values in two different columns of a dataset. Let’s use this visualisation to see the total number of cases related to the values in the column estado that represent the severity of the COVID-19 case and the column ubicacion that represents where the covid-19 case was treated.

Below is the meaning of the values that will appear in the visualisation:

  • Casa: It means that the individual was treated at home;
  • Fallecido: It means that the individual died from covid-19;
  • Leve: It means that the individual had mild covid-19 symptoms;
  • Moderado: It means that the individual had moderate covid-19 symptoms;
  • Grave: It means that the individual had severe covid-19 symptoms.

To do a pivot table, click on the Table value to change the VISUALIZATION TYPE:

In the window that will open, type Pivot table:

Click on the Pivot Table v2 and on the SELECT button.

We need to select a column to be used as columns of the pivot table and another to be used as rows of the pivot table. In this example, let’s use ubicacion in the field COLUMNS and estado in the field ROWS.

To see the number of cases grouped by the values in the columns and rows of the pivot table let’s use as metrics COUNT(*), to count all the records returned in this aggregation.

So, the final query is:

To avoid the columns with values not informed, let’s create 2 different filters. The first is to ignore the null values and the second is to ignore the values equal to “N/A” that represent the individuals that died but not by covid.

The <> represents the not equals operator.

Then click on the RUN button, and the result will be:

In the Options section selected below you have the following possibilities:

If you select the first two items (SHOW ROWS TOTAL, and SHOW COLUMNS TOTAL), that is the result:

Furthermore, it’s possible to change the aggregation function used to compute the total values in the rows in the columns by changing the value in the select box. The default value is Sum.

It’s also possible to transpose the table, by selecting the TRANSPOSE PIVOT and put different metrics side by side (if you add more than one metric in the METRICS field) by selecting the COMBINE METRICS.

4.6 Other aggregations

As mentioned before, the positive_cases_covid_d dataset represents individual data. So, we will use the AGGREGATE QUERY MODE to get the aggregate number of cases according to some conditions such as date of first symptoms, sex, department, and disease severity.

To do this, we will use the Time-series Bar Chart v2 VISUALISATION TYPE. When we select this visualisation, we have the following fields to fill:

We need to select the columns that we want to aggregate in the GROUP BY field. For the conditions listed above, we will add the columns sexo and estado in the GROUP BY field.

We also need to select a TIME COLUMN, for this visualisation. So, let’s use the fecha_inicio_sintomas to aggregate the data by the date of the first symptoms. Remember that you can fill this field with any other time column. Let’s also change the TIME GRAIN to Week and select the data in the first quarter of 2022.

To count the total number of observations aggregated by the TIME COLUMN and columns in the GROUP BY field let’s use the metrics COUNT(*).

In the FILTERS field, let’s add a filter to ignore the values N/A in the column estado, which represent the individuals that died, but not as a consequence of covid-19.

So, the final query configuration is:

After running this query (clicking on the RUN QUERY button), that is the result:

To improve the chart, let’s go to the CUSTOMIZE tab and select the checkboxes STACK SERIES and SHOW LEGEND:

So, after changing the title of the plot to COVID-19 cases in Colombia - Severity by sex over time, the result is:

Note that with this configuration, the chart is much more informative. If you hover the mouse over a bar you can see the exact value of cases according to each group.

You also can visualise the result of this aggregation in a tabular way by looking at the Data section:

Remember that you can download this data by clicking on the .csv or .json icon above the chart.

In this example, we use the data available for all the departments in Colombia. Practice by :

  • filtering the result for a specific department by adding a filter on the column departamento_nom, e.g. departamento_nom = ‘BOGOTA’.
  • selecting specific values for the columns sexo (e.g. sexo = ‘M’) and estado (e.g. estado IN (‘Moderado’,’Leve’)).

Congratulations! You reached the end of this fourth lesson on advanced data exploration.

Hopefully, you are now feel comfortable with the exploration of datasets with raw individual data and their aggregation. By now, you should also know how to inspect the values of different columns in your dataset, and visualise the distribution of quantitative values and the relationship between categorical variables.

In the next lesson you will go further into dashboard creation and learn how to enrich your dashboards. See you there :)

Contributors

The following team members contributed to this lesson:

This work is licensed under the Creative Commons Attribution Share Alike license. Creative Commons License